# GPU Computing: Introduction to CUDA

Dr Paul Richmond http://paulrichmond.shef.ac.uk





#### This lecture

- □CUDA Programming Model
- ☐ CUDA Device Code
- □CUDA Host Code and Memory Management
- □ CUDA Compilation



# Programming a GPU with CUDA













# Simple processing flow



### Simple processing flow





# Simple processing flow





### Stream Computing



- ☐ Data set decomposed into a **stream** of elements
- ☐ A single computational function (kernel) operates on each element
  - ☐A **thread** is the execution of a kernel on one data element
- ☐ Multiple Streaming Multiprocessor cores can operate on multiple elements in parallel
  - ☐ Many parallel threads
- ☐ Suitable for **Data Parallel** problems





☐ How does the stream competing principle map to the with the hardware model?







#### CUDA Software Model

- ☐ Hardware abstracted as a **Grid** of **Thread Blocks** 
  - ☐Blocks map to SMPs
  - ☐ Each thread maps onto a CUDA core
- ☐ Don't need to know the hardware characteristics
  - ☐ Code is portable across different GPU architectures







### CUDA Vector Types

□CUDA Introduces a new dim types. E.g. dim2, dim3, dim4
□dim3 contains a collection of three integers (X, Y, Z)

```
dim3 my_xyz (x_value, y_value, z_value);
```

□ Values are accessed as members

```
int x = my_xyz.x;
```



### Special dim3 Vectors

- threadIdx ☐ The location of a thread within a block. E.g. (2,1,0)DblockIdx
  - $\Box$ The location of a block within a grid. E.g. (1,0,0)
- □blockDim
  - $\Box$ The dimensions of the blocks. E.g. (3,9,1)
- □ gridDim
  - $\Box$ The dimensions of the grid. E.g. (3,2,1)

Idx values use zero indices, Dim values represent a size













### Analogy

- ☐ Students arrive at halls of residence to check in
  - ☐ Rooms allocated in order
- ☐ Unfortunately admission rates are down!
  - □Only half as many students as rooms
  - $\square$  Each student can be moved from room i to room 2i so that no-one has a neighbour



#### Serial Solution

- ☐ Receptionist performs the following tasks
  - 1. Asks each student their assigned room number
  - 2. Works out their new room number
  - 3. Informs them of their new room number







### Parallel Solution

"Everybody check your room number. Multiply it by 2 and go to that room"





□CUDA Programming Model

☐ CUDA Device Code

□ CUDA Host Code and Memory Management

□ CUDA Compilation



# A First CUDA Example

☐ Serial solution

```
for (i=0;i<N;i++) {
  result[i] = 2*i;
}</pre>
```

☐ We can parallelise this by assigning each iteration to a CUDA thread!



### CUDA C Example: Device

```
global void myKernel(int *result)
{
  int i = threadIdx.x;
  result[i] = 2*i;
}
```

- ☐ Replace loop with a "kernel"
  - ☐ Use \_\_global\_\_ specifier to indicate it is GPU code
- ☐ Use threadIdx dim variable to get a unique index
  - ☐ Assuming for simplicity we have only one block
  - ☐ Equivalent to your door number at CUDA Halls of Residence





### CUDA C Example: Host

☐ Call the kernel by using the CUDA kernel launch syntax ☐ kernel<<<GRID OF BLOCKS, BLOCK OF THREADS>>>(arguments);

myKernel<<<br/>blocksPerGrid, threadsPerBlock>>> (result);







# Vector Addition Example

☐ Consider a more interesting example

 $\square$  Vector addition: e.g. a + b = c





### Vector Addition Example

```
//Kernel Code
__global__ void vectorAdd(float *a, float *b, float *c)
 int i = threadIdx.x;
  c[i] = a[i] + b[i];
//Host Code
dim3 blocksPerGrid(1,1,1);
dim3 threadsPerBlock(N,1,1); //single block of threads
vectorAdd<<<<ble>blocksPerGrid, threadsPerBlock>>>(a, b, c);
```





### CUDA C Example: Host

- □Only one block will give poor performance
  - ☐ A block gets allocated to a single SMP!
  - ☐ Solution: Use multiple blocks

```
dim3 blocksPerGrid(N/8,1,1); //assumes 8 divides N exactly dim3 threadsPerBlock(8,1,1); //8 threads in the block
```

myKernel<<<br/>blocksPerGrid, threadsPerBlock>>>(result);

```
Thread

Block

Grid

...
```





### Vector Addition Example

```
threadIdx.x threadIdx.x threadIdx.x threadIdx.x

01234567 01234567 01234567 ... 01234567

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = N-1
```

```
//Kernel Code
__global__ void vectorAdd(float *a, float *b, float *c)
{
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  c[i] = a[i] + b[i];
}
```

 $lue{}$  The integer i gives a unique thread Index used to access a unique value from the vectors a, b and c





#### A note on block sizes

- ☐ Thread block sizes can not be larger that 1024
- $\square$  Max grid size is 2147483647 for 1D
  - ☐Grid y and z dimensions are limited to 65535
- ☐Block size should ideally be divisible by 32
  - ☐ This is the warp size in which threads are scheduled
  - □Not less than 32 as in our trivial example!
- □ Varying the block size will result in different performance characteristics
  - $\square$ Try incrementing by values of 32 and benchmark.
- $\square$  Calling a kernel with scalar parameters assumes a 1D grid of thread blocks.
  - $\square$ E.g. my kernel<<<8, 128>>> (arguments);





#### Device functions

```
    □ Kernels are always prefixed with _global_
    □ To call a function from a kernel the function must be a device function (i.e. it must be compiled for the GPU device)
    □ A device function must be prefixed with _device_
    □ A device function is not available from the host
    □ Unless it is also prefixed with _host_
```

```
int increment(int a) { return a + 1; }

__device__ int increment(int a) { return a + 1; }

__device__ __host__ int increment(int a) { return a + 1; }
```

Host only

Device only

Host and device





□CUDA Programming Model

☐ CUDA Device Code

□ CUDA Host Code and Memory Management

□ CUDA Compilation



### Memory Management

- ☐GPU has separate dedicated memory from the host CPU
- ☐ Data accessed in kernels must be on GPU memory
  - ☐ Data must be explicitly copied and transferred
- ☐cudaMalloc() is used to allocate memory on the GPU
- □cudaFree() releases memory

```
float *a;
cudaMalloc(&a, N*sizeof(float));
...
cudaFree(a);
```





### Memory Copying

- ☐Once memory has been allocated we need to copy data to it and from it.
- ☐cudaMemcpy () transfers memory from the host to device to host and vice versa

```
cudaMemcpy(array_device, array_host,
N*sizeof(float), cudaMemcpyHostToDevice);
```

```
cudaMemcpy(array_host, array_device,
N*sizeof(float), cudaMemcpyDeviceToHost);
```

- ☐ First argument is always the **destination** of transfer
- ☐ Transfers are relatively slow and should be minimised where possible





```
#define N 2048
#define THREADS PER BLOCK 128
global void vectorAdd(float *a, float *b, float *c) {
 int i = blockIdx.x * blockDim.x + threadIdx.x;
 c[i] = a[i] + b[i];
int main(void) {
    float *a, *b, *c; // host copies of a, b, c
    float *d a, *d b, *d c; // device copies of a, b, c
    int size = N * sizeof(float);
    cudaMalloc((void **)&d a, size);
    cudaMalloc((void **)&d b, size);
    cudaMalloc((void **)&d c, size);
    a = (float *)malloc(size); random floats(a, N);
    b = (float *)malloc(size); random floats(b, N);
    c = (float *) malloc(size);
    cudaMemcpy(d a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d b, b, size, cudaMemcpyHostToDevice);
    vectorAdd <<<N / THREADS PER BLOCK, THREADS PER BLOCK >>>(d a, d b, d c);
    cudaMemcpy(c, d c, size, cudaMemcpyDeviceToHost);
    free(a); free(b); free(c);
    cudaFree(d a); cudaFree(d b); cudaFree(d c);
    return 0;
```

Define macros

Define kernel

Define pointer variables

Allocate GPU memory

Allocate host memory and initialise contents

Copy input data to the device

Launch the kernel

Copy data back to host

Clean up





### Device Synchronisation

- ☐ Kernel calls are non-blocking
  - ☐ Host continues after kernel launch
  - □Overlaps CPU and GPU execution
- ☐cudaDeviceSynchronise() call be called from the host to block until GPU kernels have completed

```
vectorAdd<<<br/>blocksPerGrid, threadsPerBlock>>>(a, b, c);
//do work on host (that doesn't depend on c)
cudaDeviceSynchronise(); //wait for kernel to finish
```

- ☐ Standard cudaMemcpy calls are blocking
  - □ Non-blocking variants exist





□CUDA Programming Model

☐ CUDA Device Code

□ CUDA Host Code and Memory Management

□ CUDA Compilation



## Compiling a CUDA program

☐ CUDA C Code is compiled using **nvcc** e.g.

☐ Will compile host AND device code to produce an executable

nvcc -o example example.cu



### Compilation

- □CUDA source file (\*.cu) are compiled by nvcc
- ☐ An existing cuda.rules file creates property page for CUDA source files
  - □Configures nvcc in the same way as configuring the C compiler
  - □Options such as optimisation and include directories can be inherited from project defaults
- ☐C and C++ files are compiled with gcc



#### Device Versions

- □ Different generations of NVIDIA hardware have different compatibility □ These are classified by CUDA compute versions □ Compilation normally builds for CUDA compute version 2
  - ☐This can be changed by passing —arch to nvcc
  - □Default value is "compute 20, sm 20"
  - □E.g. nvcc source.cu -arch=compute\_20,sm\_20
  - ☐ Any hardware with greater than the compiled compute version can execute the code (backwards compatibility)
- ☐You can build for multiple versions using separator
  - □ E.g. "compute\_20, sm\_20; compute\_30, sm\_30; compute\_35, sm\_35"
  - ☐ This will increase build time and execution file size
  - ☐ Runtime will select the best version for your hardware





### Summary

- □CUDA is a C like programming language
- ☐ Programming a GPU requires moving data to and from the device
- ☐ Parallel regions are executed using Kernel
- ☐ Kernels require high levels of parallelism
  - ☐ Exposed as many threads grouped into blocks
  - ☐ Thread blocks are mapped to SMs
- ☐ Host and device code are compiled separately and linked into a single executable





#### Hands on Session

- □ Navigate to <a href="https://nvlabs.qwiklab.com">https://nvlabs.qwiklab.com</a>
  - ☐ Same account as for Alison Lowndes labs
  - □Class name: Deep Learning, Big Data and Big Compute Camp (Rabat)
- ☐Get the lab handout from
  - http://paulrichmond.shef.ac.uk/teaching/NVIDIA/rabat/

or

□http://bit.ly/2eyaSEj



